Journal of Machine Learning Research () 0-0 



Submitted 0/0; Published 0/0 



An Efficient Primal Dual Prox Method for 
Non-Smooth Optimization 

Tianbao Yang yangtia1@msu.edu 

Mehrdad Mahdavi mahdavim@msu.edu 

Rong Jin rongjin@cse.msu.edu 
Department of Computer Science and Engineering 
Michigan State University, East Lansing, MI, 48824, USA 

1 Shenghuo Zhu zsh@sv.nec-labs.com 
NEC Labs America, Cupertino, CA, 95014, USA 

Sh ■ 

< 



(N 



o 



(N 

in 
o 

(N 



Editor: ? 



Abstract 



Wc study the non-smooth optimization problems in machine learning, where both the 
^ , loss function and the regularizer are non-smooth functions. Previous studies on efficient 

empirical loss minimization assume either a smooth loss function or a strongly convex 
regularizer, making them unsuitable for non-smooth optimization. We develop an efficient 
■ method for a family of non-smooth optimization where the dual form of the loss function 

C*~) . is bilinear in primal and dual variables. We cast a non-smooth optimization problem into 

a minimax optimization problem, and develop a primal dual prox method that solves the 
minimax optimization problem at a rate of 0(1 /T), significantly faster than a standard 
gradient descent method that has 0(1/ VT) convergence rate. Our empirical study verifies 
the efficiency of the proposed method for non-smooth optimization by comparing it to the 
state-of-the-art first order methods. 

Keywords: non-smooth optimization, primal dual method, convergence rate, sparsity, 
efficiency 



5_j ■ 1. Introduction 



Formulating machine learning tasks as a regularized empirical loss minimization prob- 
lem makes an intimate connection between machine learning and mathematical optimiza- 
tion. In regularized empirical loss minimization, one tries to jointly minimize an empir- 
ical loss over training samples plus a regularization term of the model. This formula- 
tion includes support vector machine (SVM) (Hastie et ah, 2008), support vector regres- 
sion (Smola and Scholkopf, 2004), Lasso (Zhu et ah, 2003), logistic regression, and ridge 
regression (Hastie et ah, 2008) among many others. Therefore, optimization methods play 
a central role in solving machine learning problems and challenges exist in machine learning 
applications demand the development of new optimization algorithms. 

Depending on the application at hand, various types of loss and regularization functions 
have been introduced in the literature. The efficiency of different optimization algorithms 
crucially depends on the specific structures of the loss and the regularization functions. 
Recently, there have been significant interests on gradient descent based methods due to 
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their simplicity and scalability to large datasets. A well-known example is the Pegasos algo- 
rithm (Shalev-Shwartz et al., 2007) which minimizes the l\ regularized hinge loss (i.e. SVM) 
and achieves a convergence rate of 0(1/T), where T is the number of iterations, by exploit- 
ing the strong convexity of the regularizer. Several other first order algorithms (Ji and Ye, 
2009; Chen et al., 2009) are also proposed for smooth loss functions (e.g., squared loss 
and logistic loss) and non-smooth regularizers (i.e., £i j00 and group lasso). They achieve a 
convergence rate of 0(1/T 2 ) by exploiting the smoothness of the loss functions. 

In this paper, we focus on a more challenging case where both the loss function and 
the regularizer are non-smooth, to which we refer to as non-smooth optimization. Non- 
smooth optimization of regularized empirical loss has found applications in many machine 
learning problems. Examples of non-smooth loss functions include hinge loss (Vapnik, 1998), 
generalized hinge loss (Bartlett and Wegkamp, 2008), absolute loss (Hastie et al., 2008), 
and e-insensitive loss (Rosasco et al., 2004); examples of non-smooth regularizers include 
lasso (Zhu et al., 2003), group lasso (Yuan and Lin, 2006), sparse group lasso (Yang et al., 
2010), exclusive lasso (Zhou et al., 2010b), ^i j00 regularizer (Quattoni et al., 2009), and 
trace norm regularizer (Rennie and Srebro, 2005). 

Although there are already many existing studies on tackling smooth loss functions 
(e.g. square loss for regression, logistic loss for classification), or smooth regularizers (e.g. 
£| norm), there are serious challenges in developing efficient algorithms for non-smooth 
optimization. In particular, common tricks, such as smoothing non-smooth objective func- 
tions (Nesterov, 2005a, b), can not be applied to non-smooth optimization to improve con- 
vergence rate. This is because they require both the loss functions and regularizers be 
written in the maximization form of bilinear functions, which unfortunately are often vio- 
lated, as we will discuss later. In this work, we focus on optimization problems in machine 
learning where both the loss function and the regularizer are non-smooth. Our goal is to 
develop an efficient gradient based algorithm that has a convergence rate of 0(1/T) for a 
wide family of non-smooth loss functions and general non-smooth regularizers. 

It is noticeable that according to the information based complexity theory (Traub et al., 
1988), it is impossible to derive an efficient first order algorithm that generally works for 
all non-smooth objective functions. As a result, we focus on a family of non-smooth opti- 
mization problems, where the dual form of the non-smooth loss function is bilinear in both 
primal and dual variables. Additionally, we show that many non-smooth loss functions have 
this bilinear dual form. We derive an efficient gradient based method, with a convergence 
rate of 0(1/T), that explicitly updates both the primal and dual variables. The proposed 
method is referred to as Primal Dual Prox (Pdprox) method. Besides its capability 
of dealing with non-smooth optimization, the proposed method is effective in handling the 
learning problems where additional constraints are introduced for dual variables. 

The rest of this paper is organized as follows. Section 2 reviews the related work on 
minimizing regularized empirical loss especially the first order methods for large-scale op- 
timization. Section 4 presents the proposed primal dual prox method, its convergence 
analysis, and several extensions of the proposed method. Section 5 presents the empirical 
studies, and Section 6 concludes this work. 
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2. Related Work 

Our work is closely related to the previous studies on regularized empirical loss minimiza- 
tion. In the following discussion, we mostly focus on non-smooth loss functions and non- 
smooth regularizers. 

Non-smooth loss functions Hinge loss is probably the most commonly used non-smooth 
loss function for classification. It is closely related to the max-margin criterion. A number 
of algorithms have been proposed to minimize the £\ regularized hinge loss (Piatt, 1998; 
Joachims, 1999, 2006; Hsieh et al., 2008; Shalev-Shwartz et al., 2007), and the £\ regularized 
hinge loss (Cai et al., 2010; Zhu et al., 2003; Fung and Mangasarian, 2002). Besides the 
hinge loss, recently a generalized hinge loss function (Bartlett and Wegkamp, 2008) has 
been proposed for cost sensitive learning. For regression, square loss is commonly used due 
to its smoothness. However, non-smooth loss functions such as absolute loss (Hastie et al., 
2008) and e-insensitive loss (Rosasco et al., 2004) are useful for robust regression. The Bayes 
optimal predictor of square loss is the mean of the predictive distribution, while the Bayes 
optimal predictor of absolute loss is the median of the predictive distribution. Therefore 
absolute loss is more robust for long-tailed error distributions and outliers (Hastie et al., 

2008) . Rosasco et al. (2004) also proved that the estimation error bound for absolute loss 
and e-insensitive loss converges faster than that of square loss. Non-smooth piecewise linear 
loss function has been used in quantile regression (Koenker, 2005; Gneiting, 2008). Unlike 
the absolute loss, the piecewise linear loss function can model non-symmetric error in reality. 

Non-smooth regularizers Besides the simple non-smooth regularizers such as £i, £2, 
and £00 norms (Duchi and Singer, 2009), many other non-smooth regularizers have been em- 
ployed in machine learning tasks. Yuan and Lin (2006) introduced group lasso for selecting 
important explanatory factors in group manner. The £i )OQ norm regularizer has been used 
for multi-task learning (Argyriou et al., 2008). In addition, several recent works (Hou et al., 
2011; Nie et al., 2010; Liu et al., 2009) considered mixed £2,1 regularizer for feature selec- 
tion. Zhou et al. (2010b) introduced exclusive lasso for multi-task feature selection to model 
the scenario where variables within a single group compete with each other. Trace norm reg- 
ularizer is another non-smooth regularizer, which has found applications in matrix comple- 
tion (Recht et al., 2010; Candes and Recht, 2008), matrix factorization (Rennie and Srebro, 
2005; Srebro et al., 2005), and multi-task learning (Argyriou et al., 2008; Ji and Ye, 2009). 
The optimization algorithms presented in these works are usually limited: either the con- 
vergence rate is not guaranteed (Argyriou et al., 2008; Recht et al., 2010; Hou et al., 2011; 
Nie et al., 2010; Rennie and Srebro, 2005; Srebro et al., 2005) or the loss functions are as- 
sumed to be smooth (e.g., the square loss or the logistic loss) (Liu et al., 2009; Ji and Ye, 

2009) . Despite the significant efforts in developing algorithms for minimizing regularized 
empirical losses, it remains a challenge to design a first order algorithm that is able to 
efficiently solve non-smooth optimization problems at a rate of 0(1/T) when both the loss 
function and the regularizer are non-smooth. 

Gradient based optimization Our work is closely related to (sub)gradient based op- 
timization methods. The convergence rate of gradient based methods usually depends on 
the properties of the objective function to be optimized. When the objective function is 
strongly convex and smooth, it is well known that gradient descent methods can achieve 
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a geometric convergence rate (Boyd and Vandenberghe, 2004). When the objective func- 
tion is smooth but not strongly convex, the optimal convergence rate of a gradient descent 
method is 0(1/T 2 ), and is achieved by the Nesterov's methods (Nesterov, 2007). For 
the objective function which is strongly convex but not smooth, the convergence rate be- 
comes 0(1/T) (Shalev-Shwartz et al., 2007). For general non-smooth objective functions, 
the optimal rate of any first order method is 0(1/VT). Although it is not improvable in 
general, recent studies are able to improve this rate to 0(1/T) by exploring the special 
structure of the objective function (Nesterov, 2005a, b). In addition, several methods are 
developed for composite optimization, where the objective function is written as a sum of 
a smooth and a non-smooth function (Lan, 2010; Nesterov, 2007; Lin, 2010). Recently, 
these optimization techniques have been successfully applied to various machine learning 
problems, such as SVM (Zhou et al., 2010a), general regularized empirical loss minimiza- 
tion (Duchi and Singer, 2009; Hu et al., 2009), trace norm minimization (Ji and Ye, 2009), 
and multi-task sparse learning (Chen et al., 2009). Despite these efforts, one major limita- 
tion of the existing (sub)gradient based algorithms is that in order to achieve a convergence 
rate better than 0(l/\/T), they have to assume that the loss function is smooth or the 
regularizer is strongly convex, making them unsuitable for non-smooth optimization. 

Convex-concave optimization The present work is also related to convex-concave min- 
imization. Tseng (2008) and Nemirovski (2005) developed prox methods that have a con- 
vergence rate of 0(1/T), provided the gradients are Lipschitz continuous and have been 
applied to machine learning problems (Sun et al., 2009). In contrast, our method achieves 
a rate of 0(1/T) without requiring gradients to be Lipschitz continuous. Several other 
primal-dual algorithms have been developed for regularized empirical loss minimization 
that update both primal and dual variables. Zhu and Chan (2008) propose a primal-dual 
method based on gradient descent, which only achieves a rate of 0(l/y/T). Mosci et al. 
(2010) present a primal-dual algorithm for group sparse regularization, which updates the 
primal variable by a prox method and the dual variable by a Newton's method. In con- 
trast, the proposed algorithm is a first order method that does not require computing 
the Hessian matrix as the Newton's method does, and is therefore more scalable to large 
datasets. Lan et al. (2011) consider the primal-dual convex formulations for general cone 
programming and apply Nesterov's optimal first order method (Nesterov, 2007), Nesterov's 
smoothing technique (Nesterov, 2005a), and Nemirovski's prox method (Nemirovski, 2005). 
Nesterov (2005b) proposed a primal dual gradient method for a special class of structured 
non-smooth optimization problems by exploring an excessive gap technique. 

Optimizing non-smooth functions We note that Nesterov's smoothing technique (Nesterov, 
2005a) and excessive gap technique (Nesterov, 2005b) can be applied to non-smooth op- 
timization and both achieve 0(1 /T) convergence rate for a special class of non-smooth 
optimization problems. However, the limitation of these approaches is that they require 
all the non-smooth terms (i.e., the loss and the regularizer) to be written as an explicit 
max structure that consists of a bilinear function in primal and dual variables, thus limits 
their applications to many machine learning problems. In addition, Nesterov's algorithms 
need to solve additional maximizations problem at each iteration. In contrast, the pro- 
posed algorithm only requires mild condition on the non-smooth loss functions (section 4), 
and allows for any commonly used non-smooth regularizers, without having to solve an 
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(a) classification (b) regression 

Figure 1: Loss funcitons 



additional optimization problem at each iteration. Compared to Nesterov's algorithms, the 
proposed algorithm is applicable to a large class of non-smooth optimization problems, is 
easier to implement, its convergence analysis is much simpler, and its empirical performance 
is usually better. 

3. Notations and Definitions 

In this section we provide the basic setup, some preliminary definitions and notations used 
throughout this paper. 

We denote by [n] the set of integers {l, -- ,n}. We denote by (xj,yj),i G [n] the 
training examples, where Xj G X C M rf and yi is the assigned class label, which is discrete 
for classification and continuous for regression. We assume ||xj||2 < 1, Vi G [n]. We 
denote by X = (xi,-- - ,x n ) T and y = (yi,--- ,y n ) T - Let w G M. d denote the linear 
hypothesis, £(w;x, y) denote a loss of prediction made by the hypothesis w on example 
(x, y), which is a convex function in terms of w. Examples of convex loss function are 
hinge loss £(w;x, y) = max(l — yw T x, 0), and absolute loss ^(w;x, y) = |w T x — y\. To 
characterize a function, we introduce the following definitions 

Definition 1 A function £(z) : Z — > M is a G-Lipschitz continuous if 

\e(zx) - £(z 2 )| < G|| Zl - z 2 || 2) Vzi,z 2 G Z. 

Definition 2 A function l(z) : Z — >■ R is a p-smooth function if its gradient is p-Lipschitz 
continuous 

||W(zi)- W(z 2 )|| 2 < P \\ Zl -z2H2.Vz1.Z2 G Z. 

A function is non-smooth if either its gradient is not well defined or its gradient is not 
Lipschtiz continuous. Examples of smooth loss functions are logistic loss ^(w; x, y) = 
log(l + exp(— yw T x)), square loss £(w;x, y) = |(w T x — y) 2 , and examples of non-smooth 
loss functions are hinge loss, and absolute loss. The difference between logistic loss and 
hinge loss, square loss and absolute loss can be seen in Figure 1. Examples of non-smooth 
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regularizer include i?(w) = ||w||i, i.e. i\ norm, R(w) = HwHoo, i.e. norm. More 
examples can be found in section 4.1. 

In this paper, we aim to solve the following optimization problem, which occurs in many 
machine learning problems, 

1 n 

min £(w) = - V *(w; Xi, Vi ) + XR(w), (1) 
weK d n f—f 

where £(w;x, y) is a non-smooth loss function, i?(w) is a non-smooth regularizer on w, and 
A is a regularization parameter. 

We denote by IIq[z] = argmin^||z — z||| the projection of z into domain Q, and by 

IIq 1 q 2 (5 1 ) the joint projection of zi and Z2 into domains Qi and Q2, respectively. Finally, 
' V z 2/ 

we use [s][o )S+ ] to denote the projection of s into [0, s+]. 

4. Pdprox: A Primal Dual Prox Method for Non-Smooth Optimization 

We first describe the non-smooth optimization problems that the proposed algorithm can 
be applied to, and then present the primal dual prox method for non-smooth optimization. 
We prove the convergence bound of the proposed algorithms and discuss several extensions. 
Proofs for technical lemmas are deferred to the appendix. 

4.1 Non-Smooth Optimization 

We first focus our analysis on linear classifiers and denote by w € M. d a linear model. The 
extension to nonlinear models is discussed in section 4.5. Also, extension to a collection of 
linear models W £ M. dxK can be done in a straightforward way. We consider the following 
general non-smooth optimization problem: 



mm 

WGQw 



£(w) = max L(w, a; X, y) + \R(w) 



(2) 



The parameters w in domain Q w and a. in domain Q a are referred to as primal and dual 
variables, respectively. Since it is impossible to develop an efficient first order method for 
general non-smooth optimization, we focus on the family of non-smooth loss functions that 
can be characterized by bilinear function L(w, a; X, y), i.e. 

L(w, a; X, y) = c (X, y) + a T a(X, y) + w T b(X, y) + w T H(X, y)a, (3) 

where co(X, y), a(X, y), b(X, y), and H(X, y) are the parameters depending on the training 
examples (X, y) with consistent sizes. In the sequel, we denote by L(w, a) = L(w, a; X, y) 
for simplicity, and by G w (w, a) = V w L(w, a) and G a (w,a) = X7 a L(w,a) the partial 
gradients of L(w, a) in terms of w and a, respectively. 

Remark 1 One direct consequence of assumption in (3) is that the partial gradient 
Cr w (w, a) is independent of w, and G a (w, a) is independent of cc, since L(w, a) is bilinear 
in w and ex. We will explicitly exploit this property in developing the efficient optimization 
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algorithms. We also note that no explicit assumption is made for the regularizer R(w). 
This is in contrast to the smoothing techniques used in (Nesterov, 2005a, b). 

To efficiently solve the optimization problem in (1), we need first turn it into the form (2). 
To this end, we assume that the loss function can be written into a dual form, which is 
bilinear in the primal and the dual variables, i.e. 



where /(w, a;x, y) is a bilinear function in w and a, and A Q is the domain of variable a. 
Using (4), we cast problem (1) into (2) with L(w,o:;X, y) given by 



with a = (ai, • • • , a n ) defined in the domain Q a = {a = (a±, • • • , a n ) , «j £ A a }. 

Before delving into the description of the proposed algorithms and their analysis, we 
give a few examples that shows many non-smooth loss functions can be written in the form 



^(w;xi,yj) 



a,eA a 



max /(w,ai;xj,yj) 



(4) 




(5) 



of (4): 



• Hinge loss (Vapnik, 1998): 



•£(w; x, y) = max(0, 1 — yw x) 



max ail — yw x 

Q6[0,l] 



• Generalized hinge loss (Bartlett and Wegkamp, 2008): 




where a > 1. 



• Absolute loss (Hastie et al., 2008): 




• e-insensitive loss (Rosasco et al., 2004) : 



^(w;x, y) 



max(|w x — y\ — e, 0) 



a l J r a 2 — 1 



max (w x — y)(oi — 02) — e(ai + 02) 



• Piecewise linear loss (Koenker, 2005): 




Q 1 +a 2 <l 



max a\a(y — w x) + 0:2(1 — a)(w x — y). 
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? 2 loss (Nie et al., 2010): 



^(W;x,y) = ||W T x-y|| 2 = max a T (W T x-y), 

IH|2<1 



where y S M. is multiple class label vector and W = (wi, • • • , wg-). 

Besides the non-smooth loss function £(\v;x,y), we also assume that the regularizer 
R(w) is a non-smooth function. Many non-smooth regularizers are used in machine learning 
problems. We list a few of them in the following, where W = (wi, • • • , wk), € W d and 
w- 7 is the jth row of W. 

• lasso: R(w) = ||w||i, £2 norm: i?(w) = ||w||2, and norm: R(w) = ((wH^. 

• group lasso: R(w) = J2f=i \fdg\\' w g\\2, where w g S W* 3 . 

• exclusive lasso: R(W) = Y^j=i II wJ 111- 

• £2,1 norm: R(W) = X^=i ll wJ 'l|2- 

. 4,oc norm: R{W) = Y? j= i \\™ j \\oo- 

• trace norm: i?(W) = ||W||i, the summation of singular values of W. 

• other regularizers: R(W) = {Y^ = i \\wkW2) ■ 



Note that unlike (Nesterov, 2005a, b), we do not further require the non-smooth regularizer 
to be written into a bilinear dual form, which could be violated by many non-smooth 

regularizers, e.g. R(W) = (^Yl^=i llwfclh) or more generally i?(w) = V(||w[|), where V(z) 
is a monotonically increasing function. 

We end this section by presenting a lemma showing an important property of the bilinear 
function L(w,a). 

Lemma 3 Let L(w,a) be bilinear in w and a. as in (3). Assuming there exists c > such 
that ||i?(X,y)||2 < c, then for any 0:1,0:2 G Qa, and Wi,W2 G Q w we have 

||G a (wi,a 1 ) - G Q (w2,o 2 )||2 < c ll w i - w 2 |||, (6) 
||G w (wi,oi) - G w (w 2 ,o:2)||2 < c||«i - 0:2 Hi- (7) 

Remark 2 The value of constant c in Lemma 3 is an input to our algorithms used to 
set the step size. In the Appendix A, we show how to estimate constant c for certain loss 
functions. In addition the constant c in bounds (6) and (7) do not have to be the same as 
shown by the the example of generalized hinge loss in Appendix A. 
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Algorithm 1 The Pdprox-dual Algorithm for Non-Smooth Optimization 

1: Input: step size 7 = yl/(2c), where c is specified in (6). 

2: Initialization: wo = 0, (3 = 

3: for t = 1, 2, . . . do 

4: Q t = U Qa [/3 f _ ! + 7<3 a (W t _ 1 , /3 t _ ! )] 

5: Wt = argmin wgSw \ ||w - (w t _i - 7G w (wt_i, a t ))\\ 2 + 7Ai?(w) 
6: Pt = n Q a [A-i + ~fG a {w t , a t )) 

7: end for 

8: Output w-r = Xlt=i W t/T and = ]C^=i a t/T. 



4.2 The Proposed Primal-Dual Prox Methods 

In this subsection, we present two variants of Primal Dual Prox (Pdprox) method for solving 
the non-smooth optimization problem in (2). The common feature shared by the two 
algorithms is that they update both the primal and the dual variables at each iteration. In 
contrast, most first order methods only update the primal variables. The key advantages 
of the proposed algorithms is that they are able to capture the sparsity structures of 
both primal and dual variables, which is usually the case when both the regularizer and 
the loss functions are both non-smooth. The two algorithms differ from each other in the 
number of copies for the dual or the primal variables, and the specific order for updating 
those. Although our analysis shows that the two algorithms share the same convergence 
rate; however, our empirical studies show that the second algorithm is more advantageous 
when the number of training examples is significantly larger than the number of features 
(i.e. n>d). 

Pdprox-dual algorithm Algorithm 1 shows the first primal dual prox algorithm for op- 
timizing the problem in (2). Compared to the other gradient based algorithms, Algorithm 1 
has several interesting features: 

(i) it updates both the dual variable a and the primal variable w. This is useful when 
additional constraints are introduced for the dual variables, as we will discuss later. 

(ii) it introduces an auxiliary dual variable (3 in addition to a, and updates both a and (3 
at each iteration by a gradient mapping. The gradient mapping on the dual variables 
into a sparse domain allows the proposed algorithm to capture the sparsity of the dual 
variables (more discussion on how the sparse constraint on the dual variable affects the 
convergence is presented in Section 4.5). Compared to the second algorithm presented 
below, we refer to Algorithm 1 as Pdprox-dual algorithm since it introduces an 
auxiliary dual variable in updating. 

(iii) the primal variable w is updated by a composite gradient mapping (Nesterov, 2007) 
in step 5. Solving a composite gradient mapping in this step allows the proposed algo- 
rithm to capture the sparsity of the primal variable. Similar to many other approaches 
for composite optimization (Duchi and Singer, 2009; Hu et al., 2009), we assume that 
the mapping in step 5 can be solved efficiently. (This is the only assumption we made 
on the non-smooth regularizer. The discussion in Section 4.4 shows that the proposed 
algorithm can be applied to a large family of non-smooth regularizers). 
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Algorithm 2 The Pdprox-primal Algorithm for Non-Smooth Optimization 

1: Input: step size 7 = yl/(2c), where c is specified in (7). 

2: Initialization: uo = 0, ao = 

3: for t = 1, 2, . . . do 

4: w t = argmin w6Sw \ ||w - (u t _i - 7G w (u t _i, a t -i)) \\\ + 7Ai?(w) 

5: at = n Qa [a t _i + 7G , a (w t , a t _i)] 

6: u t = w t + 7(G w (u t _i,a t _i) - G w (w t ,a t )) 

7: end for 

8: Output w-r = X^t=i W t/T and Sx = X/t=i a t/T. 



(iv) the step size 7 is fixed to y / l/(2c), where c is the constant specified in Lemma 3. 
This is in contrast to most gradient based methods where the step size depends on T 
and/or A. This feature is particularly useful in implementation as we often observe 
that the performance of a gradient method is sensitive to the choice of the step size. 

Pdprox-primal algorithm In Algorithm 1, we maintain two copies of the dual variables 
a and /3, and update them by two gradient mappings. We can actually save one gradient 
mapping on the dual variable by first updating the primal variable w t , and then updating a t 
using partial gradient computed with W(. As a tradeoff, we add an auxiliary primal variable 
u, and update it by a simple calculation. The detailed steps are shown in Algorithm 2. 
Similar to Algorithm 1, Algorithm 2 also needs to compute two partial gradients (except 
the initial partial gradient on the primal variable), i.e. G w (-,at) and G a (w(, •). Different 
from Algorithm 1, Algorithm 2 (i) maintains (wt,at, u t) at each iteration with 0(2d + n) 
memory, while Algorithm 1 maintains (ott, Wt,/3 t ) at each iteration with 0(2n + d) memory; 
(ii) and replaces one gradient mapping on an auxiliary dual variable (3 t with a simple update 
on an auxiliary primal variable Uj. When the number of examples n is much larger than 
the dimension d, i.e. n S> d, Algorithm 2 could save the memory and the computational 
cost. However, the convergence rate of Algorithm 2 is the same to Algorithm 1. Because 
it introduces an auxiliary primal variable, we refer to Algorithm 2 as the Pdprox-primal 
algorithm. 

It should be noted that although Algorithm 1 uses a similar strategy for updating the 
dual variables a and (3, but it is significantly different from the mirror prox method (Nemirovski, 
2005). First, unlike the mirror prox method that introduces an auxiliary variable for w, 
Algorithm 1 introduces a composite gradient mapping for updating w. Second, Algorithm 1 
updates w; using the partial gradient computed from the updated dual variable ott rather 
than f3 t _i- Third, Algorithm 1 does not assume that the overall objective function has 
Lipschitz continuous gradients, a key assumption that limits the application of the mirror 
prox method. 

4.3 Convergence Analysis 

This section establishes bounds on the convergence rate of the proposed algorithms. We 
begin by presenting a theorem about the convergence rate of Algorithms 1 and 2. For ease 
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of analysis, we first write (2) into the following equivalent minimax formulation 

min max F(w, a) = L(w, a) + Ai?(w). (8) 

Our main result is stated in the following theorem: 

Theorem 4 By running Algorithm 1 or Algorithm 2 with T steps, we have 



F(wt, a) — F(-w, olt) < 



I w ll2 + |«| 2 



for any w £ Q w and a £ Q a . In particular, 



, „r*l|2 | ll^,*l|2 

max F(wt,oc) — min .F(w, S^) < 



w *ll2 + II" 112 



«eQ a weSw A /(2/c)T 

where (w*,a*) is i/ie optimal solution to max a6 g a i^w^a) — min wg g w F(w,6tT)- 

In order to aid understanding, we present the proof of Theorem 4 for each algorithm 
seperately in the following subsections. 

4.3.1 Convergence Analysis of Algorithm 1 

For the simplicity of analysis, we assume Q w = M d is the whole Euclidean space. We discuss 
how to generalize the analysis to a convex domain Q w in Section 4.5. In order to prove 
Theorem 4 for Algorithm 1, we present a series of lemmas to pave the path for the proof. 
We first restate the key updates in Algorithm 1 as follows: 

at = n Sa [A-i+7G ! a(w t _ 1 ,/3 t _ 1 )] , (9) 
w t = arg min -||w - (w t _i - jG w (wt-i,a t ))\\l + 7Ai?(w), (10) 



/3t = n ett [/3 t _ 1 + 7 Ga(w tj a t )]. (11) 
Lemma 5 The updates in Algorithm 1 are equivalent to the following gradient mappings, 

^J= n e«,R<H £t-i + 7G„(u t _i,/3 t _ 1 ) 

\ut-i -7(G w (u t _i,a t ) + Avt), 

and 

'A 
u, 



^u t _i - 7(G w (wt, a t ) + Avf) y 



with initialization Uo = Wo, where Vf £ dR(wt) is a partial gradient of the regularizer on 
w t . 
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Proof First, we argue that there exists a fixed (sub)gradient v t £ dR(wt) such that the 
composite gradient mapping (10) is equivalent to the following gradient mapping, 

Wt = n Rd [w t _i - 7 (G w (w t _i, a t ) + Av t )] . (12) 

To see this, since w; is the optimal solution to (10), by first order optimality condition, 
there exists a subgradient Vj = dR(wt) such that W( — W(_i + 7G w (w<_i, at) + 7Avt = 0, 
i.e. 

w ( = w t _i - 7G w (w t _i, OL t ) ~ 7^v t , 

which is equivalent to (12) since the projection IT K d is an identical mapping. 

Second, the updates in Algorithm 1 for (a,/3, w) are equivalent to the following updates 
for (a, (3, w, u) 

a t = n Q a [Pt-i + l G oc{^t-i,Pt-\)\ , 

w t = n M d [u t _i - 7 (G w (u t _i,at) + Av t )] , (13) 
Pt = n S a [Pt-i + 7G a (w t , a t )] , 

u t = w t - 7(G w (w t , a t ) - G w (ut_i, at)), (14) 

with initialization uo = wo- The reason is because Uf = wt, t = 1, • • • due to G w (wt, at) = 
G w (ut_i, at), where we use the fact that L(w, a) is linear in w. 

Finally, by plugging (13) for Wf into the update for Ut in (14), we complete the proof of 
Lemma 5. ■ 



The reason that we translate the updates for (at, w;, j3 t ) in Algorithm 1 into the updates 
for (at, wt, f3 t , Uf) in Lemma 5 is because it allows us to fit the updates for (at, Wt, Pt, Uf) 
into Lemma 12 as presented in Appendix C, which leads us to a key inequality as stated in 
Lemma 6 to prove Theorem 4. 

Lemma 6 For all t = 1, 2, • • • , and any w 6 M. d , a € Q a , we have 



G w (-w t ,ctt) + \vt\ ^w t -w^ < l 



-G a (w t ,a t ) 



at — a 



W — Uf_i 

a - /3 t _i 



w — Ut 
a-/3t 
1, 



+ 7 ||G a (wt,at) - Ga^t-i,^!)^ - -||wt - ut_i|| 2 . 



The proof of Lemma 6 is deferred to Appendix C. We are now ready to prove the main 
theorem for Algorithm 1. 

Proof [of Theorem 4 for Algorithm 1] Since F(w,a) is convex in w and concave in a, we 
have 



F(w t , a t ) - F(w, a t ) < (G w (w t , a t ) + Avj) T (w t - w), 
F(w t , a) - F(wt, at) < -G a (w t , a t ) T (a t - a), 
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where Vt £ dR{wt) is the partial gradient of R(w) on w< stated in Lemma 5. Combining 
the above inequalities with Lemma 6, we have 



7 (F(w t , a t ) - F(w, a. t ) + F(w t , a) - F(w t , a t )) 



< 


1 


/ w 






2 




-A-i 


< 


1 


( w 


- u t _i 




2 


\ a 


-A-i 


< 


1 


( w 


- u t _i 




2 


[a 


-J3t-i 



w - u t 
a- A 

w - u t 

a- (3 t 

w - u t 

a-(3 t 



+ j 2 \\G a {w t ,a t ) - G a (u t _ 1 ,/3 t _ 1 )||^ - -\\w t - u t _i||| 



+ 7 2 c||w t - u t _i[|| 



W( - u ( _i ; 



where the second inequality follows the inequality (6) in Lemma 3 and the fact 7 = -y/l/(2c). 
By adding the inequalities of all iterations, we have 



kill + Hi! 



(15) 



We complete the proof by using the definitions of wy, Sy, and the convexity-concavity of 
F(w, a) with respect to w and a, respectively. ■ 



4.3.2 Convergence Analysis of Algorithm 2 



We can prove the convergence bound for Algorithm 2 by following the same path. In the 
following we present the key lemmas similar to Lemmas 5 and 6, with proofs omitted. 

Lemma 7 There exists a fixed partial gradient vj 6 <9i?(wj) such that the updates in Algo- 
rithm 2 are equivalent to the following gradient mappings, 



and 



w, 



IL 



u t _i - 7(G w (u t _i,/3 t _ 1 ) + \v t ] 
Pt-i +7G a (w t ,/3 i _ 1 ) 



u f _i - 7(G w (w t , a t ) + Xv t 
Pt-i +jG a (-w t ,cxt) 



with initialization (3 = cxq. 

Lemma 8 For all t = 1, 2, • • • , and any w S 



7 



G w (w t ,a t ) + Avt 
-G a (w t ,a t ) 



W( — w 
ctj — a 



< 



1 



2 



W - Uf_i 

a - Pt-i 



+ 7 2 ||G w (w t ,a t ) - G w (u t _i, /9 t _i)| 



w - u t 
1 



K - A-il 
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Proof [of Theorem 4 for Algorithm 2] Similar to proof of Theorem 4 for Algorithm 1, we 
have 



7 (F(w t , OL t ) - F(w, a. t ) + F(w t , a) - F(w t , a t )) 



< 


1 


/ w 


- u t _i 




2 




-Pt-i 


< 


1 


( w 


- u t _i 




2 


\a 


-Pt-l 


< 


1 


( w 


- u t _i 




2 


\a 





w - u t 
a-/3 t 

w - u t 
a - A 
w - u t 
a- A 



+ 7 2 ||G w (w t ,Q! t ) - G w (u t -i, /3 t _x)\\l - -\\at t - fit 
+ 7 2 c||o!t - Pt-i\\l ~ oll a * ~ Pt-iWi 



where the last step follows the inequality (7) in Lemma 3 and the fact 7 = yl/(2c). By 
adding the inequalities of all iterations, we have 



^2(F(w u a)-F(w,cx t )) < 



[win + ii « in 



t=i 



VWcjT 



(16) 



We complete the proof by using the definitions of wy, S7 1 . and the convexity-concavity of 
F(w, a) with respect to w and a, respectively. ■ 

Finally, we state the convergence bound for the objective £(w) in (2) in the following 
corrolary. 

Corrolary 9 Let w* be the optimal solution to (2), bounded by 1 1 w* 1 1 2 < D\, and ||ck||| < 
D2, Va £ Q a . 5y running Algorithm 1 or 2 with T iterations, we have 



£(w T ) -£(w*) < 



D 1 + D 2 



Proof Let w = w* = argmin wg g w £(w) and a* = argmaxQ, g Q a F(wt, ck) in Theorem 4, 
then we have 



r*l|2 I ||«,*||2 



, * ^ . w 2 + 112 

max F wt, ct) — f< (w , ckt) < , ■ 

«es a V ' V 11 ~ ^/{2/c)T ' 



Since £(w) = max F(yv, a) > F(w, olt), then we have 



£(w T ) - £(w*) < 



D1 + D2 
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Comparison with existing algorithms We compare the proposed algorithm to the Pe- 
gasos algorithm (Shalev-Shwartz et al., 2007) for minimizing the £\ regularized hinge loss. 
Although in this case both algorithms achieve a convergence rate of 0(1/T), their depen- 
dence on the regularization parameter A is very different. In particular, the convergence 
rate of the proposed algorithm is 0((1 + n\) /[y2n\T]) by noting that 1 1 w* 1 1 2 = 0(1/A), 
He** Hi — \\ a * 111 — n i an d c = 1/n, while the Pegasos algorithm has a convergence rate 
of 0(1/ [AT]), where we ignore the logarithmic term fn(T). According to the common as- 
sumption of learning theory (Wu and Zhou, 2005; Smale and Zhou, 2003), the optimal A is 
0( n -i/(T+i)) if the probability measure can be approximated by the closure of RKHS 7~Lk 
with exponent < r < 1. As a result, the convergence rate of the proposed algorithm is 
0{-yJn/T) while the convergence rate of Pegasos is 

0( n V(i-Hr)/T). Since r G (0,1], the pro- 
posed algorithm is more efficient than the Pegasos algorithm, particulaly when the number 
of training examples n is very large. This is verified by our empirical studies in section 5.6 
(see Figure 8). 

4.4 Implementation 

We briefly discuss how to efficiently solve the optimization problems for updating the primal 
and dual variables in Algorithms 1 and 2. Both a and (3 are updated by a gradient mapping 
that requires computing the projection into the domain Q a . When Q a is only consisted of 
box constraints (e.g. hinge loss, absolute loss, and e-insentive loss), the projection JIqcI"] 
can be computed by thresholding. When Q a is comprised of both box constraints and 
a linear constraint (e.g. generalized hinge loss), the following lemma gives an efficient 
algorithm for computing Y\.Q \P*\- 

Lemma 10 For Q a = {a : a € [0,s] n ,o: T v < p}, the optimal solution a* to projection 
\\qJ\Ol] is computed by 

a* = [Si - 77Ui][o, s ],Vi G [n], 
where r/ = if Y2i[®-i]{o,s] v i — P an d otherwise is the solution to the following equation 

- T}Vi][0,s]Vi ~ P = 0. (17) 

i 

Since ^J^« — V v i][0,s] v i — 1 *s monotonically decreasing in rj, we can solve rj in (17) by a 
bisection search. 

For the composite gradient mapping for w G Q w = M. d , there is a closed form solution 
for simple regularizers (e.g., and decomposable regularizers (e.g., ^1,2)- Efficient 

algorithms are available for composite gradient mapping when the regularizer is the £00 
and £i )0 oj ° r trace norm. More details can be found in (Duchi and Singer, 2009; Ji and Ye, 
2009). Here we present an efficient solution to a general regularizer V(||w||), where ||w|| 
is either a simple regularizer (e.g., £±, £2, and ^oo) or a decomposable regularizer (e.g., £±^ 
and £ioo), and V(z) is convex and monotonically increasing for z > 0. An example is 
V(wi, • • • , Wk) = ||wfc||2) 2 , where wi, . . . , w^- forms a partition of w. 
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Lemma 11 Let V*(rj) be the convex conjugate ofV(z), i.e. V{z) = max^ rjz — 14 (77) . Then 
the solution to the composite mapping 



can be computed by 



w* = arg min — ||w — w||2 + AV(||w||), 
weQw 2 



w* = arg min — ||w — w||| + A77H w|| , 
weQw 2 



where r] satisfies ||w*||— Vj.(ji) = 0. Since both ||w*|| and — V^(n) are non-increasing functions 
in r\, we can efficiently compute n by a bisection search. 

4.5 Extensions and Discussion 

Nonlinear model For a nonlinear model, the min-max formulation becomes 

min max L(g, ct;X, y) +XR(g), 

g£H K a£Q a 

where T~L K is a Reproducing Kernel Hilbert Space (RKHS) endowed with a kernel function 
«(•, •). Algorithm 1 can be applied to obtain the nonlinear model by changing the primal 
variable to g. For example, step 5 in Algorithm 1 is modified to the following composite 
gradient mapping 

g t = arg min - \\g - gt-l\\n + 7Ai?(#), 
g&i K 2 

where 

9t-i = {9t-i ~ 7V fl -£f(<fe-i,at;X,y)) . 
Similar changes can be made to Algorithm 2 for the extension to the nonlinear model. 

Incorporating the bias term Algorithms 1 or 2 can be modified to incorporate the bias 
term b by viewing b as an additional primal variable. For example, in Algorithm 1, at each 
iteration, we can update b by one prox mapping b t = Wq [£>t_i+7V&£(wt_i,&$_i,a!t;X,y)], 

where Qb is an appropriate domain for b. The step size 7 is changed to 7 = y/ (l/c)/2 when 
the bias term is considered, and the convergence rate is amplified by a constant y/2. 

Domain constraint on primal variable Now we discuss how to generalize the con- 
vergence analysis to the case when a convex domain Q w is imposed on w. We introduce 
.R(w) = Ai2(w) + Q(w), where Q(w) is an indicator function for w G Q w , i.e. 



Q(w) 



w e Q w 

+00 otherwise 



Then we can write the domain constrained composite gradient mapping in step 5 of Al- 
gorithm 1 or step 4 of Algorithm 2 into a domain free composite gradient mapping as the 
following: 

w t = arg min -||w - (w t -i - ^G w (w t -i,a t ))\\l +7-R(w), 
weR d 2 

w t = arg min -||w - (u t _i - 7G w (u t _i,a t _i))||| + 7^(w). 

wGK d 2 
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Then we have an equivalent gradient mapping, 

wt = w t _i - 7G w (w t _i, a t ) - jdR(w t ), 
w t = u t _i - 7G w (u t _i,a t _i) - 7&R(w t ). 

Then Lemmas 5 and 6, and Lemmas 7 and 8 all hold as long as we replace Avj with 
Vt £ dR(wt). Finally in proving Theorems 4 we can absorb Q(w) in L(w, at) + R(w) into 
the domain constraint. 

Additional constraints on dual variables One advantage of the proposed primal dual 
prox method is that it provides a convenient way to handle additional constraints on the 
dual variables a. Several studies introduce additional constraints on the dual variables. 
In (Dekel and Singer, 2006), the authors address a budget SVM problem by introducing 
a 1 — oo interpolation norm on the empirical hinge loss, leading to a sparsity constraint 
\\ct\\i < m on the dual variables, where m is the target number of support vectors. The 
corresponding optimization problem is given by 

1 ™ 

min max — a;(l — %w T Xj) + XR(w). (18) 
weR d cte[o,l],||t*||i<m n ^—f 

In (Huang et al., 2010), a similar idea is applied to learn a distance metric from noisy 
training examples. We can directly apply Algorithms 1 or 2 to (18) with Q a given by Q a = 
{a:a:£[0, l] n ,||a:||i< m}. The prox mapping to this domain can be efficiently computed 
by Lemma 10. It is straightforward to show that the convergence rate is [D± + m]/[y/2nT] 
in this case. 

5. Experiments 

In this section we present empirical studies to verify the efficiency of the proposed algorithm. 
We organize our experiments as follows. 

• In subsections 5.1, 5.2, and 5.3 we compare the proposed algorithm to the state-of- 
the-art first order methods that directly update the primal variable at each itera- 
tion. We apply all the algorithms to three different tasks with different non-smooth 
loss functions and regularizers. The baseline first order methods used in this study 
include the gradient descent algorithm (gd), the forward and backward splitting al- 
gorithm (fobos) (Duchi and Singer, 2009), the regularized dual averaging algorithm 
(rda) (Xiao, 2009), the accelerated gradient descent algorithm (agd) (Chen et al., 
2009). Since the proposed method is a non-stochastic method, we compare it to the 
non-stochastic variant of gd, fobos, and rda. Note that gd, fobos, rda, and agd 
share the same convergence rate of 0(1/ y/T) for non-smooth problems. 

• In subsection 5.4, our algorithm is compared to the state-of-the-art primal dual gra- 
dient method (Nesterov, 2005b), which employs an excessive gap technique for non- 
smooth optimization, updates both the primal and dual variables at each iteration, 
and has a convergence rate of 0(1 /T). 
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• In subsection 5.5, we test the proposed algorithm for optimizing problem in (18) with 
a sparsity constraint on the dual variables. 

• In subsection 5.6, we compare the two variants of the proposed method on a data set 
when and compare Pdprox to Pegasos algorithm. 

All the algorithms are implemented in Matlab and run on a 2.4GHZ machine. Since 
the performance of the baseline algorithms gd, fobos and rda depends heavily on the 
initial value of the stepsize, we generate 21 values for the initial stepsize by scaling their 
theoretically optimal values with factors 2[ _10:1:10 1, and report the best convergence among 
the 21 possible values. The stepsize of agd is changed adaptively in the optimization 
process, and we just give it an appropriate initial step size. Since in the first four subsections 
we focus on comparison to baselines, we use the Pdprox-dual variant of the proposed Pdprox 
algorithm. Finally, all algorithms are initialized with a solution of all zeros. 

5.1 Group lasso regularizer for Grouped Feature Selection 

In this experiment we use the group lasso for regularization, i.e., R(w) = J2 g \/^9ll w 9ll 2 > 
where w g corresponds to the gth group variables and d g is the number of variables in 
group g. To apply Nesterov's method, we can write R(w) = max|| U9 || 9<1 ^ y^wju s . 
We use the MEMset Donar dataset (Yeo and Burge, 2003) as the testbed. This dataset 
was originally used for splice site detection. It is divided into a training set and a test set: 
the training set consists of 8, 415 true and 179, 438 false donor sites, and the testing set 
has 4, 208 true and 89, 717 false donor sites. Each example in this dataset was originally 
described by a sequence of {A, C, G, T} of length 7. We follow (Yang et al., 2010) and 
generate group features with up to three-way interactions between the 7 positions, leading 
to 2,604 attributes in 63 groups. We normalize the length of each example to 1. Following 
the experimental setup in (Yang et al., 2010), we construct a balanced training dataset 
consisted of all 8, 415 true and 8, 415 false donor sites that are randomly sampled from all 
179, 438 false sites. 

Two non-smooth loss functions are examined in this experiment: hinge loss and absolute 
loss. Figure 2 plots the values of the objective function vs. running time (second), using 
two different values of regularization parameter, i.e., A = 10~ 3 ,10 -5 to produce different 
levels of sparsity. We observe that (i) the proposed algorithm Pdprox clearly outperforms 
all the baseline algorithms in all the cases; (ii) for the absolute loss, which has a sharp 
curvature change at zero compared to hinge loss, the baseline algorithms of gd, fobos, 
rda, agd, especially of agd that is originally designed for smooth loss functions, deteriorate 
significantly compared to the proposed algorithm Pdprox. Finally, we observe that for the 
hinge loss and A = 10 -3 , the classification performance of the proposed algorithm on the 
testing dataset is 0.6565, measured by maximum correlation coefficient (Yeo and Burge, 
2003). This is almost identical to the best result reported in (Yang et al., 2010) (i.e., 
0.6520). 

5.2 £i >O0 regularization for Mult i- Task Learning 

In this experiment, we perform multi-task regression with £i j00 regularizer (Chen et al., 
2009). Let W = (wi, • • • , w/u) G ]R rfxfc denote the k linear hypotheses for regression. The 
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Figure 2: Comparison of convergence speed for hinge loss ((a),(b)) and absolute loss 
((c),(d)) with group lasso regularizer. Note that for better visualization we plot 
the objective starting from 10 seconds in all figures. The objective of all algo- 
rithms at second is 1. 



^1,00 regularizer is given by R(W) = Y2j=i ll wJ 'Hooj where w- 5 is the jth row of W. To apply 
Nesterov's method, we rewrite the £i j00 regularizer as R(W) = max|| Uj .|| 1<1 X^i=i u j T w J • 
We use the School data set (Argyriou et al., 2008), a common dataset for multi-task learn- 
ing. This data set contains the examination scores of 15, 362 students from 139 secondary 
schools corresponding to 139 tasks, one for each school. Each student in this dataset is 
decribed by 27 attributes. We follow the setup in (Argyriou et al., 2008), and generate a 
training data set with 75% of the examples from each school and a testing data set with 
the remaining examples. We test the algorithms using both the absolute loss and the e- 
insensitive loss with e = 0.01. The initial stepsize for gd, fobos, and rda are tuned similarly 
as that for the experiment of group lasso. We plot the objective versus the running time in 
Figure 3, from which we observe the similar results in the group feature selection task, i.e. 
(i) the proposed Pdprox algorithm outperforms the baseline algorithms, (ii) the baseline 
algorithm of agd becomes even worse for e-insensitive loss than for absolute loss. Finally, 
we observe that the regression performance measured by root mean square error (RMSE) on 
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Figure 3: Comparison of convergence speed for absolute loss ((a),(b)) and e-insensitive loss 
((c), (d)) with £i jOC regularizer. Note that for better visualization we plot the 
objective starting from 10 seconds in all figures. The objective of all algorithms 
at second is 20.52. 



the testing data set for absolute loss and e-insensitive loss is 10.34 (optimized by Pdprox), 
comparable to the performance reported in (Chen et al., 2009). 



5.3 Trace norm regular izat ion for Max- Margin Matrix Factorization/ Matrix 
Completion 

In this experiment, we evaluate the proposed method using trace norm regularization, a 
regularizer often used in max-margin matrix factorization and matrix completion, where 
the goal is to recover a full matrix X from partially observed matrix Y. The objective is 
composed of a loss function measuring the difference between X and Y on the observed en- 
tries and a trace norm regularizer on X, assuming that X is low rank. Hinge loss function is 
used in max-margin matrix factorization (Rennie and Srebro, 2005; Srebro et al., 2005), and 
absolute loss is used instead of square loss in matrix completion. We test on 100K Movie- 
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Figure 4: Comparison of convergence speed for (a,b): max-margin matrix factorization with 
hinge loss and trace norm regularizer, and (c,d): matrix completion with absolute 
loss and trace norm regularizer. 



Lens data set that contains 1 million ratings from 943 users on 1682 movies. Since there 
are five distinct ratings that can be assigned to each movie, we follow (Rennie and Srebro, 
2005; Srebro et al., 2005) by introducing four thresholds #1,2,3,4 to measure the hinge loss 
between the predicted value Xij and the ground truth Yij. Because our goal is to demon- 
strate the efficiency of the proposed algorithm for non-smooth optimization, therefore we 
simply set #1,2,3,4 = (0,3,6,9). Note that we did not compare to the optimization algo- 
rithm in (Rennie and Srebro, 2005) since it cast the problem into a non-convex problem 
by using explicit factorization of X = UV T , which suffers a local minimum, and the op- 
timization algorithm in (Srebro et al., 2005) since it formulated the problem into a SDP 
problem, which suffers from a high computational cost. To apply Nesterov's method, we 
write ||X||i = maxii_4||<i tr(A T X), and at each iteration we need to solve a maximization 
problem max||A|j<i Atr(A T X) — ^||A|||i, where ||A|| is the spectral norm on A. The solution 
of this optimization is obtained by by performing SVD decomposition of X and thresholding 
the singular values appropriately. Since MoiveLens data set is much larger than the data 

1. http: //www . cs .umn.edu/Research/GroupLens/ 
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sets used in last two subsections, in this experiment, we (i) run all the algorithms 1000 
iterations and plot the objective versus time; (ii) enlarge the range of tuning parameters 
to 2l- 15:1:15 l The results are shown in Figure 4, from which we observe that (i) Pdprox 
can quickly reduce the objective in a small amount of time, e.g., for absolute loss when 
setting A = 10 -3 in order to obtain a solution with an accuracy of 10 -3 , Pdprox needs 
10 3 second, while agd needs 3.2 x 10 4 seconds; (ii) for absolute loss no matter how we 
tune the stepsizes for each baseline algorithm, Pdprox performs the best; and (hi) for hinge 
loss when A = 10~ 5 , by tuning the stepsizes of baseline algorithms, gd, fobos, and rda 
can achieve comparable performance to Pdprox. We note that although agd can achieve 
smaller objective value than Pdprox at the end of 1000 iterations, however, the objective 
value is reduced slowly. 

5.4 Comparison: Pdprox vs Primal-Dual method with excessive gap technique 

In this section, we compare the proposed primal dual prox method to Nesterov's primal 
dual method (Nesterov, 2005b), which is an improvement of his algorithm in (Nesterov, 
2005a). The algorithm in (Nesterov, 2005a) for non-smooth optimization suffers a problem 
of setting the value of smoothing parameter that requires the number of iterations to be 
fixed in advance. Nesterov (2005b) addresses the problem by exploring an excessive gap 
technique and updating both the primal and dual variables, which is similar to the proposed 
Pdprox method. We refer to this baseline as Pdexg. We run both algorithms on the three 
tasks as in subsections 5.1, 5.2, and 5.3, i.e., group feature selection with hinge loss and 
group lasso regularizer on MEMset Donar data set, multi-task learning with e-insensitive 
loss and £i iDO regularizer on School data set, and matrix completion with absolute loss and 
trace norm regularizer on 100K MoiveLens data set. To implement the primal dual method 
with excessive gap technique, we need to intentionally add a domain on the optimal primal 
variable, which can be derived from the formulation. For example, in group feature selection 
problem whose objective is l/w^" =1 £(w T Xj, yi) + A y^dgW^glU, we can derive that the 
optimal primal variable w* lies in 1 1 w 1 1 2 < Sollwglb < /\ . , where d m - m = mm„d„. 
Similar techniques are applied to multi-task learning and matrix completion. 

The performance of the two algorithms on the three tasks is shown in Figure 5. Since 
both algorithms are in the same category, i.e. updating both primal and dual variables 
at each iteration and having a convergence rate in the order of 0(1/T), we also plot the 
objective versus the number of iterations in the bottom panels of each subfigure in Figure 5. 
The results show that the proposed Pdprox method converges faster than Pdexg on MEMset 
Donar data set for group feature selection with hinge loss and group lasso regularizer, and 
on 100K MovieLens data set for matrix completion with absolute loss and trace norm 
regularizer. However, Pdexg performs better on School data set for multi-task learning 
with e-insensitive loss and ^i iOC regularizer. One interesting phenomenon we can observe 
from Figure 5 is that for larger values of A (e.g., 10 -3 ), the improvement of Pdprox over 
Pdexg is also larger. The reason is because that the proposed Pdprox captures the sparsity 
of primal variable at each iteration. This does not hold for Pdexg because it casts the 
non-smooth regularizer into a dual form and consequently does not explore the sparsity of 
the primal variable at each iteration. Therefore the larger of A, the sparser of the primal 
variable at each iteration in Pdprox that yields to larger improvement over Pdexg. For the 
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Figure 5: Pdprox vs Primal-Dual method with excessive gap technique. 



example of group feature selection task with hinge loss and group lasso regularizer, when 
setting A = 10~ 3 , the sparsity of the primal variable (i.e., the proportion of the number of 
group features with zero norm) in Pdprox averaged over all iterations is 0.7886. However, by 
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Table 1: Running time (forth column) and classification accuracy (fifth column) of Pdprox 
for (18) and of Liblinear on noisily labeled training data, where noise is added to 
labels by random flipping with a probability 0.2. We fix A = 1/n or C = 1 in 
Liblinear. In the second column, we report the number of training examples (re), 
the number of attributes (d), and also the accuracy by training Liblinear on the 
original data and evaluating it on the testing data. 



Data Set 


(n,d)/ACC 


Alg. 




Running time 


ACC 


a9a 


(32561, 123) 
0.8501 


Pdprox(m= 
Liblinear 


=200) 


0.82s(0.013) 
1.15s(0.5688) 


0.8344(0.0005) 
0.7890(0.0044) 


rcvl 


(20242, 47236) 
0.9654 


Pdprox(m= 
Liblinear 


=200) 


1.57s(0.2328) 
3.30s(0.7399) 


0.9405(0.0022) 
0.9366(0.0015) 


covtype 


(571012, 54) 
0.7580 


Pdprox(m= 
Liblinear 


=4000) 


48s(3.34) 
37s(0.64) 


0.7358(10" 4 ) 
0.6866(10- 5 ) 



reducing A to 10 -5 the average sparsity of the primal variable in Pdprox is reduced to 0. In 
both settings the average sparsity of the primal variable in Pdexg is 0. The same argument 
also explains why Pdprox does not perform as well as Pdexg on School data set when setting 
A = 10 -5 , since in this case the primal variables in both algorithms are not sparse. When 
setting A = 1(R 3 , the average sparsity (i.e., the proportion of the number of features with 
zero norm across all tasks) of the primal variable in Pdprox and Pdexg is 0.3766 and 0, 
respectively. Finally, we also observe similar performance of the two algorithms on the three 
tasks with other loss functions including absolute loss for group feature selection, absolute 
loss for multi-task learning, and hinge loss for max-margin matrix factorization. 

5.5 Sparsity constraint on the dual variables 

In this subsection, we examine empirically the proposed algorithm for optimizing the prob- 
lem in equation (18), in which a sparsity constraint is introduced for the dual variables. 
We test the algorithm on three large data sets from the UCI repository, namely, a9a, 
rcvl(binary) and covtye 2 . In the experiments we use £| regularizer and fix A = 1/n. First, 
we run the proposed algorithm 100 seconds on the three data sets with different values of 
m = 100, 200, 400 and plot the objective versus the number of iterations. The results are 
shown in Figure 6, which verify our theoretical convergence bound (i.e., 0([D -\-m]/[\^2n\]) 
of the proposed algorithm for (18), i.e., the smaller the m, the faster the convergence. 

Second, we demonstrate that the formulation in equation (18) with a sparsity constraint 
on the dual variables is useful in the case when labels are contaminated with noise. To 
generate the noise in labels, we randomly flip the labels with a probability 0.2. We run 
both the proposed algorithm for (18) and Liblinear 3 on the training data with noise added 
to the labels. The stopping criterion for the proposed algorithm is when duality gap is less 
than 10 -3 , and for Liblinear is when the maximal dual violation is less than 10 -3 . The 

2. http : //www. csie .ntu. edu. tw/~ cjlin/libsvmtools/datasets 

3. http : / /www . csie . ntu . edu . tw/~ c j lin/liblinear 
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Figure 7: Comparison of Pdprox-dual vs. Pdprox-primal on covtype data set. 



running time and accuracy on testing data averaged over 5 random trials are reported in 
Table 1, which demonstrate that in the presence of noise in labels, by adding a sparsity 
constraint on the dual variables, we are able to obtain better performance than Liblinear 
trained on the noisily labeled data. Furthermore the running time of Pdprox is comparable 
to, if not less than, that of Liblinear. 

Finally, we note that choosing a small m in equation (18) is different from simply 
training a classifier with a small number of examples. For instance, for rcvl, we have run 
the experiment with 200 training examples, randomly selected from the entire data set. 
With the same stopping criterion, the testing performance is 0.8131 (±0.05), significantly 
lower than that of optimizing (18) with m = 200. 



5.6 Comparisons: Pdprox-dual vs. Pdprox-primal and Pdprox vs. Pegasos 

In this subsection, we verify (i) the Pdprox-primal is more efficient than Pdprox-dual when 
n S> d, and (ii) the proposed Pdprox method converges faster than Pegasos when A = 
0(n -1 /( 1+r )), r G (0,1]. For a fair comparison, we use £| norm for the regularizer since 
Pegasos is designed for ^ regularizer. We choose the covtype data set as testing bed, which 
contains n = 581012 examples, and each example is represented by a vector of d = 54 
dimension. 

Figure 7 shows the comparison of Pdprox-dual and Pdprox-primal on covtype data 
set for two loss functions, i.e. hinge loss and generalized hinge loss. The domains for the 
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Figure 8: Comparison of convergence speed of Pdprox vs. Pegasos on covtype data set. 



dual variables of these two loss functions are Q a = {a : a G [0,1]™} and Q a = {a = 
(a 1 , a 2 ) G [0, l] nx2 , a 1 + a 2 < 1}, respectively. We can see that the domain for generalized 
hinge loss is complex than that for hinge loss, and we observe that Pdprox-primal is more 
efficient than Pdprox-dual for generalized hinge loss, since the former one maintains one 
copy of dual variables and only needs to compute one projection of the dual variables to the 
corresponding domain while the latter one maintains two copies of the dual variables and 
requires two projections. However, the two algorithms perform almost the same on simple 
domain of dual variables for hinge loss. 

Figure 8 shows the comparison of Pdprox vs. Pegasos on Covtype data sets with three 
different levels of A = ra -0,5 , n -0 ' 8 , n . For Pdprox, we use the Pdprox-primal algorithm. 
Since Pegasos can perform stochastic updating, we also report the performance of Pegasos 
algorithm by sampling 1 example and 1% examples for computing the stochastic gradient. 
We run both algorithms 1000 seconds, report the objective versus time in the top panel 
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of Figure 8, and report the objective versus the number of iterations (achieved in 1000 
seconds) in the middle panel. In the bottom panel, we also report the objective versus 
the first 500 iterations to better understand the behavior of algorithms at the beginning 
of iterations. The results show that (i) Pdprox converges faster than Pegasos when A = 
0( n -l/(i+T)) )T = 0,0.25,1, which verifies our discussion in subsection 4.5; (ii) the objective 
of Pegasos always has a sharp jump at the beginning of iterations, which becomes severe 
when A is reduced. This is because at the beginning, the gradient of the primal variable 
is not informative compared to that in Pdprox which uses the updated dual variables to 
compute the gradient. Finally, we observe that the performance of stochastic Pegasos with 
sampling 1% examples becomes better than the performance of batch Pegasos when A is 
reduced to be less than n -0 ' 8 , however, stochastic Pegasos performs consistently worse than 
Pdprox. 

6. Conclusions 

In this paper, we study non-smooth optimization in machine learning where both the loss 
function and the regularizer are non-smooth. We develop an efficient gradient based method 
for a family of non-smooth optimization problems in which the dual form of the loss function 
can be expressed as a bilinear function in primal and dual variables. We show that the 
proposed algorithm achieves a convergence rate of 0(1/T), significantly more efficient than 
the other existing first order methods for non-smooth optimization. Our empirical studies 
demonstrate the efficiency of the proposed algorithm in comparison to the state-of-the- 
art first order methods for non-smooth optimization problems, and the effectiveness of the 
proposed algorithm for optimizing the problem with a sparse constraint on the dual variables 
in the presence of noisy labels. In future, we plan to adapt the proposed algorithm for 
stochastic updating and for distributed computing environments. 
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Appendix A. Derivation of constant c for (generalized) hinge loss 

As mentioned before, it is easy to derive the constant c in equations (6) and (7) for the non- 
smooth loss functions listed before under the assumption that ||x||2 < 1. As an example, 
here we derive the constant for hinge loss and generalized hinge loss. For other non-smooth 
loss functions, we can derive the value of c in a similar way. For hinge loss, L(w,q) in (5) 
is given by 



2009. 




i=i 
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and its partial gradients are 



G a (w,a) = -1 - -(x x yi, •■■ ,x n y n ) T w, 
n n 

G w (w, a) = --X(a o y), 

n 

where 1 denotes a vector of all ones, and o denotes the element-wise product. Then, 



n 



||G w (wi,ai) - G v/ (w 2 ,a 2 )\\l 



1 n 




=1 


1 

n 2 


n 

i=i 



< 



1 n 



a 



2\2 



n . 

2 »=1 
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which implies c = 1/n. For the example of generalized hinge loss, L(w, a) in (5) is 

1 " 

L(w,a;X,y) = - y~]a}(l - ayjW T Xj) + a 2 (l - y;w T x;), 



8=1 



where a = [a , or] € Q a = {a : a € [0, l] n , a + a < 1}, and its partial gradients are 
G a (w,a) = -[1, 1] - -[a(xiyi, • • • ,x n y n ) T w, (xiyi, • • • ,x n y n ) T w], 



71 



n 



G w (w, a) = --X(a(a ! o y) + a 2 o y), 

n 

where 1 denotes a vector of all ones, and o denotes the element-wise product. Then for any 
Wj , w 2 and cxi = (a 1 ' 1 , a 2 ' 1 ), a 2 = (a 1 ' 2 , a 2 ' 2 ) G Q a , given ||x|| 2 < 1, we have 



||G a (wi,ai) - G a (\v 2 ,a 2 )\\ 2 F 



||G w (wi,ai) - G w (w 2 ,a 2 )|l2 



■^(WiXjyj - w 2 Xiyi) < — - — ||wi-w 2 || 2 , 



i=l 



J? 



2a 



i=l 



1,1 1,2n2 , z 2,1 



2,2\2 



n 

i=l i=l 

2a 2 .. Il2 

< ai - «2 f, 

n 

which implies c = (a 2 + l)/n in equation (6) and c = (2a 2 )/n in equation (7). We can 
derive the value of c in (6) and (7) similarly for other non-smooth loss functions. 

Appendix B. Proof of Lemma 3 

Since 

G a (w, a; X, y) = a(X, y) + H(X, y) T w, 
G w /(w, a; X, y) = b(X, y) + H(X, y)a. 
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Then 

||G w (wi,ai;X,y) - G w (w 2 , a 2 ; X, y)||| < \\H(X, y) T (wi - w 2 )||| < c ll w i ~ w 2|||> 
||G w (wi,ai;X,y) - G w (w 2 ,a 2 ;X,y)||| < \\H(X,y)(a 1 - a 2 )\\l < c\\ai - a 2 |||, 

where we use the assumption ||.ff(X, y) || § = ll^(X, y) T H2 — c - 
Appendix C. Proof of Lemma 6 

In order to proof Lemma 6, we first present the following lemma with its proof. 

Lemma 12 Let Z be a convex compact set, and U C Z be convex and closed, z$ G Z, and 
7 > 0. Considering the points, 

z h = argmin-||z - (z - 7£)lll> 
zeu z 



zi = argmin-||z - (z - Ji])\\l, 

z£U I 



then for any z G U, we have 



jr] T (z h - z) < i||z - z || 2 - ^ H z ~ z ill2 + 7 2 ||? - V\\l ~ 2 



- Z0II2 + ll z ^ ~ z i\\l 



Equipped with above lemma, it is straightforward to prove Lemma 6. We note that the two 
updates in Lemma 5 are the same as the two updates in Lemma 12 if we make the following 
correspondences: 

U = Z = R d x Q a , z = ( W ) G U, 
( u 4 _i\ / wA f u t 

,o "Ia-J , *-UJ'" i "U 

_ / G w (u t _i,a t ) + AvA _ / G w (w t) a t ) + Av t 
V -G a (u ( -i,/3 t -i) /' V -G a (w t ,a t ) 

Then the inequality in Lemma 6 follows immediately the inequality in Lemma 12. 

Lemma 12 is a special case of Lemma 3.1 (Nemirovski, 2005) for Euclidean norm. A 
proof is provided here for completeness. 
Proof [of Lemma 12] Since 

z h = argmin^llz - (z - 76 III: 

z£U I 

zi = argrnin-||z - (z - 7^)111: 

z6C/ 2 

by the first order optimality condition, we have 

(z - z h ) T ( 7 £ - z + z h ) > 0, Vz G U, (19) 
(z-zi) T ( 7 7 ? -z + zi) > 0,VzG U. (20) 
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Applying (19) with z = zi and (20) with z = z^, we get 

~f(z h - zi) T £ < (z - z h ) T (z h - Zi), 
7(zi - z h ) T r] < (z - zi) T (zi - z h ). 

Summing up the two inequalities, we have 

7(zfc - zi) T (f - rj) < (zi - z h ) T (z h - zi) = -||zi - z h \\ 2 2 . 



Then 

- TlhW^-h - z l|b > ~l{ z h - z l) T (f - V) > ll z l - z h\\l- ( 21 ) 

We continue the proof as follows: 

X ,, ||2 1„ „2 

-||z-z || 2 - -||z-zi|| 2 

= ^ll z l|l2 - ^ll z o||| - ( z l - z 0) Tz + ( z - Zl) T (zi - Z ) 

=^ll z i|l2 - ^ll z o||2 - ( z i - z o) Tz o + ( z - zi) T (7r/ + zi - z ) - (z - zi) T 7?] 



>^ll z i|l2 - ^ll z o||| - ( z i - z o) Tz o - ( z - zi) T 7?/ 



glWll - dMll - ( z l - z 0) Tz - ( z /i - z l) T 7 ? ?+( z fe 



z) T 7r/, 



where the inequality follows from (20). 



1 

>- 
~2 
1 



I 

z l||| - ^ 1 1 z 1 1 2 - ( z l - z o) Tz O - ( z h - Zl) T 77/ 

z i\\l - ~ ( z i - z o) Tz o - ( z h - z i) T 7(?? - - ( z h - zi) T 7^ 

zi 111 - ^INolli ~ ( z i - z o) Tz o - ( z h - z i) T 7(?? - 



+ ( z i - z h) T (7£ - z o + z ft) - ( z i - z h) T ( z h - z Q ) 



z i\\l - ^Nlll - ( z i - z o) Tz o - ( z h - z i) T 7(?7 - ~ ( z i - z h) T { z h - z ) 

z l|l2 - ^\\ z o\\l - ( z h - z 0) Tz - ( z h - z l) T 7( ? / - - ( z l - z ft) Tz ?i 



1 II Il2 1 || ||2 / \T 
^ll^lb - ^l^lb ~ ( Z l ~ Z h) z h 



\ z h ~ z l\\l + ^\\ z h ~ z o||| 



+ 



II Il2 1 n ||2 / \T 
dl^lb - ^INIb ~ \ z h ~ z 0j z 



(z h - Z!) T 7(r/ 



l z /i - z i|bll?7 - fib 



>-{|| Z/l - Zl || 2 + ||z fe - zo|| 2 } - 7 2 lb? - f Hi, 

where the first inequality follows from (19), and the last inequality follows from (21). Com- 
bining the above results, we have 

7(z ft - z) T r] < i||z - zolU - ^||z - zi || I +J 2 \\v ~ fill - ^{\\ z h - z i||l + ll z h - z o|||}- 
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Appendix D. Proof of Lemma 10 

By introducing Lagrangian multiplier for constraint ^ a>iVi < p, we have the following 
min-max problem 



- a|| 2 + 77 a^, - 



max mm 

v ae[o,i] n 2 

The solution to a is «j = [cEj — ^*Wj][ 0j i]- By KKT condition, the optimal rf is equal to if 
^JS.;][o.ipi < p, otherwise we have 

-lf v i][0,l] v i ~ P = °- 

Since the left side of above equation is a monotonically decreasing function in 77*, we can 
compute rf by efficient bi-section search. 

Appendix E. Proof of Lemma 11 

Using the convex conjugate V*(rj) of V(z), the composite mapping can be written as 

min — 1| w — will + A max (r/||w|| — V*(rj)) . 
w 2 v 

The problem is equivalent to maximize the following function on 77, 

min — llw — w[|| + Anllwll ) — XV^M. 
w 2 J 

Let w(t?) denote the solution to the minimization problem. Then the optimal solution of rj 
satisfies 

A||w(7 / )||-AK , (^) = 0, 



i.e. 



w 



Vi{ri) = 0. 



It is easy to show that ||w(t/)|| is a non-increasing function in 77. Similarly, since V* (77) is a 
convex function, its negative gradient —V£(t)) is a non-increasing function. Therefore, we 
can compute the optimal 77 by bi-section search. 
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